A Practical Part - of - Speech TaggerDoug Cutting and Julian
نویسندگان
چکیده
We present an implementation of a part-of-speech tagger based on a hidden Markov model. The methodology enables robust and accurate tagging with few resource requirements. Only a lexicon and some unlabeled training text are required. Accuracy exceeds 96%. We describe implementation strategies and optimizations which result in high-speed operation. Three applications for tagging are described: phrase recognition; word sense disambiguation; and grammatical function assignment. 1 Desiderata Many words are ambiguous in their part of speech. For example, \tag" can be a noun or a verb. However, when a word appears in the context of other words, the ambiguity is often reduced: in \a tag is a part-of-speech label," the word \tag" can only be a noun. A part-of-speech tagger is a system that uses context to assign parts of speech to words. Automatic text tagging is an important rst step in discovering the linguistic structure of large text corpora. Part-of-speech information facilitates higher-level analysis, such as recognizing noun phrases and other patterns in text. For a tagger to function as a practical component in a language processing system, we believe that a tagger must be: Robust Text corpora contain ungrammatical constructions , isolated phrases (such as titles), and non-linguistic data (such as tables). Corpora are also likely to contain words that are unknown to the tagger. It is desirable that a tagger deal gracefully with these situations. EEcient If a tagger is to be used to analyze arbitrarily large corpora, it must be eecient|performing in time linear in the number of words tagged. Any training required should also be fast, enabling rapid turnaround with new corpora and new text genres. Accurate A tagger should attempt to assign the correct part-of-speech tag to every word encountered. Tunable A tagger should be able to take advantage of linguistic insights. One should be able to correct systematic errors by supplying appropriate a priori \hints." It should be possible to give diierent hints for diierent corpora. Reusable The eeort required to retarget a tagger to new corpora, new tagsets, and new languages should be minimal. 2 Methodology 2.1 Background Several diierent approaches have been used for building text taggers. Greene and Rubin used a rule-based approach in the TAGGIT program Greene and Rubin, 1971], which was an aid in tagging the Brown corpus Francis and Ku cera, 1982]. TAGGIT disambiguated 77% of the corpus ; the rest was done manually over a period of …
منابع مشابه
A Practical Part-of-Speech Tagger
We present an implementation of a part-of-speech tagger based on a hidden Markov model. The methodology enables robust and accurate tagging with few resource requirements. Only a lexicon and some unlabeled training text are required. Accuracy exceeds 96%. We describe implementation strategies and optimizations which result in high-speed operation. Three applications for tagging are described: p...
متن کاملEthnomethodology and Conversational Analysis
In a speech community, people utilize their communicative competence which they have acquired from their society as part of their distinctive sociolinguistic identity. They negotiate and share meanings, because they have commonsense knowledge about the world, and have universal practical reasoning. Their commonsense knowledge is embodied in their language. Thus, not only does social life depend...
متن کاملPorting a Stochastic Part-of-Speech Tagger to Swedish
A b stract The Xerox Part-of-Speech Tagger (XPOST) claims to be practical. One aspect of practicality as defined here is reusability. Thus it is meant to be easy to port XPOST to a new language. To test this, XPOST was ported to Swedish. This port is described and evaluated. In previous work on part-of-speech tagging, a practical part-of-speech tagger was defined as one with the following set o...
متن کاملDesign and Implementation of an Intelligent Part of Speech Generator
The aim of this paper is to report on an attempt to design and implement an intelligent system capable of generating the correct part of speech for a given sentence while the sentence is totally new to the system and not stored in any database available to the system. It follows the same steps a normal individual does to provide the correct parts of speech using a natural language processor. It...
متن کاملThe Effect of Artificial Aging Treatment and Lubrication Modes on the Cutting Force and the Chip Surface Morphology when Drilling Al-Si-Mg (A356) Cast Alloys
This article reports the effects of various artificial aging methods and lubrication modes (dry, mist, wet) on the recorded cutting forces and chip morphology in drilling Al-Si-Mg (A356) cast alloys. In the course of this work, the work part sampled were as-received alloy (T0), solution heat-treated alloy (SHT) and then aged alloys at 155°C, 180°C, and 220°C (T4, T6, T61, T7), respectively. The...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1992